Some inital imports:
In [1]:
    
import pandas as pd
import numpy as np
% matplotlib inline
from matplotlib import pyplot as plt
    
In [2]:
    
data = pd.read_csv('../data/data_with_problems.csv', index_col=0)
print('Our dataset has %d columns (features) and %d rows (people).' % (data.shape[1], data.shape[0]))
data.head(15)
    
    
    Out[2]:
Time to deal with the issues previously found.
Drop the duplicated rows (which have all column values the same), check the YPUQAPSOYJ row above. Let us use the drop_duplicates to help us with that by keeping only the first of the duplicated rows.
In [3]:
    
mask_duplicated = data.duplicated(keep='first')
mask_duplicated.head(10)
    
    Out[3]:
In [4]:
    
data = data.drop_duplicates(keep='first')
print('Our dataset has now %d columns (features) and %d rows (people).' % (data.shape[1], data.shape[0]))
data.head(10)
    
    
    Out[4]:
You could also consider a duplicate a row with the same index and same age only by setting data.drop_duplicates(subset=['age'], keep='first'), but in our case it would lead to the same result. Note that in general it is not a recommended programming practice to use the argument 'inplace=True' (e.g., data.drop_duplicates(subset=['age'], keep='first', inplace=True)) --> may lead to unnexpected results.
This one of the major, if not the biggest, data problems that we faced with. There are several ways to deal with them, e.g.:
In [5]:
    
missing_data = data.isnull()
print('Number of missing values (NaN) per column/feature:')
print(missing_data.sum())
print('And we currently have %d rows.' % data.shape[0])
    
    
That is not terrible to the point of fully dropping a column/feature due to the amount of missing values. Nevertheless, the action to do that would be data.drop('age', axis=1). The missing_data variable is our mask for the missing values:
In [6]:
    
missing_data.head(8)
    
    Out[6]:
This can be done with dropna(), for instance:
In [7]:
    
data_aux = data.dropna(how='any')
print('Dataset now with %d columns (features) and %d rows (people).' % (data_aux.shape[1], data_aux.shape[0]))
    
    
This can be done with fillna(), for instance:
In [8]:
    
data_aux = data.fillna(value=0)
print('Dataset has %d columns (features) and %d rows (people).' % (data_aux.shape[1], data_aux.shape[0]))
    
    
So, what happened with our dataset? Let's take a look where we had missing values before:
In [9]:
    
data_aux[missing_data['age']]
    
    Out[9]:
In [10]:
    
data_aux[missing_data['height']]
    
    Out[10]:
In [11]:
    
data_aux[missing_data['gender']]
    
    Out[11]:
Looks like what we did was not the most appropriate. For instance, we create a new category in the gender column:
In [12]:
    
data_aux['gender'].value_counts()
    
    Out[12]:
In [13]:
    
data['height'] = data['height'].replace(np.nan, data['height'].mean())
data[missing_data['height']]
    
    Out[13]:
In [14]:
    
data.loc[missing_data['age'], 'age'] = data['age'].median()
data[missing_data['age']]
    
    Out[14]:
In [15]:
    
data['gender'].value_counts(dropna=False)
    
    Out[15]:
Let's replace MALE by male to harmonize our feature.
In [16]:
    
mask = data['gender'] == 'MALE'
data.loc[mask, 'gender'] = 'male'
# validate we don't have MALE:
data['gender'].value_counts(dropna=False)
    
    Out[16]:
Now we don't have the MALE entry anymore. Let us fill the missing values with the mode:
In [17]:
    
the_mode = data['gender'].mode()
# note that mode() return a dataframe
the_mode
    
    Out[17]:
In [18]:
    
data['gender'] = data['gender'].replace(np.nan, data['gender'].mode()[0])
data[missing_data['gender']]
    
    Out[18]:
In [19]:
    
data.isnull().sum()
    
    Out[19]: